test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up)#599
Open
wyuc wants to merge 6 commits into
Open
test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up)#599wyuc wants to merge 6 commits into
wyuc wants to merge 6 commits into
Conversation
6cd1160 to
ca403bc
Compare
135f659 to
f5950ca
Compare
f5950ca to
ca4ad98
Compare
Adds eval/orchestration/ following the outline-language pattern: a runner that A/B-tests the director against the same scenario with two prompt variants — current main (post-#554) and a synthesised pre-#554 baseline that strips the role-aware summary labels AND the new system.md rules 10/11/12 together. Pass criterion is now framed as a regression guard rather than a fixture discrimination test: every scenario's post-fix END rate must stay below EVAL_END_THRESHOLD (default 20%). The pre-vs-post Δ is reported as informational data, since #554's reviewer feedback was that earlier fixtures didn't discriminate. Empirical finding across 7 shipped model configs (gpt-4.1-mini, gpt-4o-mini, gpt-5.4-nano, qwen-plus, qwen3.5-flash, deepseek-chat, deepseek-v4-flash, gemini-2.5-flash, claude-haiku-4-5) with 5 scenarios modelled on #511 (incl. the exact Tiananmen-3D objection trace, soft pushback after a long resolved-looking discussion, topic pivot, brief acknowledgement, and explicit teacher-signals-end-then-user-objects): no scenario produced a non-zero END rate in either variant. Useful data for the #554/#598 discussion — the prompt-layer rules don't measurably change behavior on these shipped models with these prompts, but the eval is now wired so future regressions show up.
…w-up)
Adds an A/B eval probing whether the director routes correctly when the
conversation contains an unanswered user question — both in the first-turn
drift case (agents started answering but drifted onto adjacent topics,
no frustration yet) and in the escalated frustration case (user has
already complained multiple times).
The prod symptom this targets ("答非所问"): agents reply on adjacent
topics rather than answering the literal question, and the director
keeps picking peer agents for "variety" instead of routing back to
the teacher to actually answer.
Scenarios: 5 synthesized cases across math, biology, English grammar,
physics, and calculus. 3 are first-turn drift (no frustration yet),
2 include frustration signals. Each captures: user asks specific
question, agents reply on adjacent topics, director is asked to pick
the next move.
Decision rule-judged into USER | TEACHER | OTHER_AGENT | END:
- USER → ✓ valid (cue user to clarify)
- TEACHER → ✓ valid (re-route to teacher to answer)
- OTHER_AGENT → ✗ wrong (peer-agent "variety" routing — the bug)
- END → ✗ wrong
A/B:
- baseline : current director template with rule 13 stripped
- with_rule : current director template as-shipped
Pass = with_rule.correctRate ≥ EVAL_PASS_THRESHOLD (default 0.7).
Pre-vs-post Δ is reported as informational only.
Note this commit only adds the eval; the rule it strips/restores in
the A/B is injected in the next commit.
Adds rule 13 to lib/prompts/templates/director/system.md so the director recognizes when the conversation contains an unanswered user question and routes appropriately — to the teacher (literal role match) by default, or to USER cue when the question is ambiguous or the user has already complained multiple times. This addresses the "答非所问" symptom: when agents drift onto adjacent topics, the director keeps picking peer agents for "variety" rather than routing back to the teacher to actually answer. The rule is preventive (applies whenever the question is unanswered, including the first turn after drift) rather than reactive-only (waiting for the user to express frustration). Rule wording uses constrained-output framing: literal `role: teacher` match (not "best fit" / "highest priority" / "most knowledgeable"), and an explicit list of forbidden choices (peer agents, END). Earlier softer wordings let the model pick the assistant when it seemed more topically relevant; the literal-match wording prevents that. Validated by the question-answering eval added in the previous commit. On google:gemini-3-flash-preview, 5 samples per variant across 5 synthesized scenarios: | Scenario | baseline | with_rule | |----------------------------------------------------|----------|-----------| | math_quadratic_axis_drift_no_frustration | 100% | 100% | | bio_dark_reaction_drift_no_frustration | 80% | 100% | | english_team_isare_drift_no_frustration | 20% | 100% | | physics_inertial_mass_escalated_frustration | 0% | 100% | | calculus_product_rule_drift_no_frustration | 60% | 100% | | **mean** | **52%** | **100%** | 3 of 5 scenarios reproduce the bug at baseline (correct rate ≤ 60%); the rule lifts every scenario to 100%. The remaining 2 baselines are already high — the rule doesn't regress them.
Adds an "Answering the User's Question" section to the agent system
prompt. Before this, the agent SP had no rule for direct-answering — the
persona, role guideline, current-state context, and length guidelines all
pushed the agent to advance the lesson and inspire thought, with no
explicit directive to first answer the user's literal question.
The result observed in prod ("答非所问"): when the user asks a specific
question (a formula, a yes/no, a term), agents reply with adjacent
concepts, examples, or follow-up questions — but never the literal
answer.
The new section directs the agent, on every response where the user
message contains a question, to:
- lead with the concrete answer to the literal question,
- not pivot to an adjacent topic even when it seems pedagogically
richer,
- treat "inspire thought" and peer-differentiation guidance as
applying only AFTER the literal question is answered,
- say "I don't know" directly when uncertain rather than answering
a different question,
- on frustration signals from prior turns, look back at the original
question (before the frustration) and answer THAT specifically.
This is the agent-side counterpart to the director rule landed in the
previous commit. Director routes to the teacher when the question is
unanswered; the teacher now has explicit guidance for what to do when
dispatched.
…ors from completion Two coupled cleanups in one commit since the outcome-handler rewrite naturally falls out of removing the client-side max-turns loop. The director graph used `maxTurns: turnCount + 1` to force "one director→agent cycle per request" — a hack expressing the single-round contract as a turn limit. This commit removes the hack and the surrounding user-facing setting, replacing implicit intent with explicit graph topology. Server: - `director-graph.ts`: drop `maxTurns` from state, drop the hard-cap check at the top of `directorNode`, change `agent_generate → director` edge to `agent_generate → END`. Single-round contract is now expressed by topology. - `lib/types/chat.ts`: drop `maxTurns` and `currentTurn` from `SessionConfig`; drop `maxTurns?` from `CreateSessionRequest.trigger`. Client: - `lib/store/settings.ts`: drop `maxTurns` state, `setMaxTurns` action, the `'10'` default, and the localStorage migration path. - `components/agent/agent-bar.tsx`: drop the entire compact-stepper UI block plus the unused MessageSquare/Minus/Plus lucide imports. - `components/settings/agent-settings.tsx`: drop the AgentSettings prop and the multi-agent-only Input block. - `lib/chat/agent-loop.ts`: drop `maxTurns` parameter, change `while (turnCount < maxTurns)` to `while (true)` controlled by inner exits, drop `'max_turns'` reason from `AgentLoopOutcome`. - `components/chat/use-chat-sessions.ts`: drop the maxTurns lookup, drop the trailing arg to runAgentLoop, drop vestigial `maxTurns: 0` / `currentTurn: 0` from session creation. - `eval/whiteboard-layout/runner.ts`: drop unused MAX_AGENT_TURNS plus the trailing arg. i18n: drop `settings.maxTurns` and `settings.maxTurnsDesc` from six locales (en-US, zh-CN, zh-TW, ja-JP, ar-SA, ru-RU). Motivation for removing the user-facing setting: the LLM director controls round length via cue_user / END decisions, so a numeric "max discussion turns" knob is redundant with rule-driven routing. When the director prompt was buggy, the cap masked the bug by force- ending; once director routing is fixed, the cap just removes user agency. Before, every non-`cue_user` outcome of the agent loop mapped to `status: 'completed'` in `use-chat-sessions.ts`. That conflated four distinct paths: - `end`: director chose END (legitimate completion) - `empty_turns`: two consecutive agents returned no text or actions (parser failure, model timeout, etc.) - `no_done`: SSE stream ended without a `done` event (infrastructure failure) - `aborted`: user pressed stop (handled elsewhere) The new switch routes `end` to completed, `empty_turns` and `no_done` through `clearLiveSessionAfterError` so the user sees an inline System message describing what went wrong instead of a quiet "completed". - `lib/types/chat.ts`: add `'error'` to `SessionStatus`. - `components/chat/use-chat-sessions.ts`: rewrite the outcome handler as a switch; have `clearLiveSessionAfterError` set `status: 'error'` so all entry points (the runAgentLoop outcome AND existing throw paths from fetchChat / SSE error events) converge on the same visible state. - `components/chat/session-list.tsx`: add `AlertCircle` red icon for the new error status. - i18n: add `chat.error.emptyAgentResponses` and `chat.error.streamInterrupted` strings to all six locales. This is the in-product fix that the prod investigation surfaced as an independent root cause of the #511 "discussion ended unexpectedly" symptom — separate from the director and agent prompt fixes in the earlier commits of this PR.
ca4ad98 to
f3fa658
Compare
…nswering script Two minor CR follow-ups: - components/chat/use-chat-sessions.ts: the `settingsState` lookup at the top of runAgentLoopFn was the entry point for the maxTurns lookup ripped in the previous commit; with maxTurns gone, the binding is unused and ESLint flags it. Drop the line. (Other uses of useSettingsStore.getState() elsewhere in this file are unaffected.) - package.json: the answering eval runner had no convenience script, while the premature-end runner did. Add eval:orchestration:answering for symmetry and discoverability.
Contributor
Author
CR loop summary (2 rounds, converged)Ran 2 rounds of code review via the Round 1 flagged 2 Important issues:
Both fixed in Round 2 verified:
Round 2 verdict: loop converged, no new findings. Out-of-scope items deliberately left for follow-up (called out but not addressed in this PR):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #598 follow-up items 1 and 2 (and surfaces additional in-product fixes discovered during the prod investigation).
The root symptom this PR addresses: agents drift onto adjacent topics rather than answering the user's literal question ("答非所问"). The director then keeps picking peer agents for "variety" instead of routing back to the teacher to actually answer. Frustration signals from the user are just one sub-case where this is loud enough to notice; the underlying bug fires earlier, often on the first response.
What's in this PR
Commit 1: premature-END regression guard
5 scenarios modelled on #511's described symptoms. A/B across 9 models × 5 scenarios × 5 samples per variant: 0% END rate in both variants on every scenario. The prompt-layer rules added in #554 don't measurably change director END behavior on these inputs. Kept as a regression guard so any future change that raises the END rate trips this eval.
Commit 2: director question-answering eval
5 synthesized scenarios — 3 first-turn drift (no frustration yet), 2 with escalated frustration. Decision rule-judged: USER | TEACHER | OTHER_AGENT | END. A/B: current main vs main with rule 13 stripped.
Commit 3: director rule 13 — Answering the User's Question
Constrained-output rule injected at the end of the # Rules section. Triggers whenever the conversation has an unanswered user question. Constraints: route to LITERAL
role: teacherby default, USER cue when ambiguous or after multiple unresolved complaints, forbid peer-agent "variety" routing and END while the question is open.On google:gemini-3-flash-preview, 5 samples per variant:
3 of 5 scenarios reproduce the bug at baseline (≤60% correct, peer-agent "variety" routing dominates). The rule lifts every scenario to 100%.
Commit 4: agent rule — Answering the User's Question
Adds a section to the agent system prompt that directs the agent, on every response where the user message contains a question, to lead with the concrete answer to the literal question before any other content. Pairs with the director rule — director routes to the teacher when the question is unanswered; the teacher now has explicit guidance for what to do when dispatched.
Commit 5: rip maxTurns end-to-end + distinguish errors from completion
maxTurns rip: the director graph used
maxTurns: turnCount + 1to force "one director→agent cycle per request" — a hack expressing the single-round contract as a turn limit. Replaced with explicit graph topology (agent_generate → END). The user-facing "Max Discussion Turns" setting is also removed: with rule-driven routing (cue_user / END), a numeric cap is redundant and was previously masking director bugs by force-ending.Error vs completion: before, every non-
cue_useroutcome of the client agent loop mapped tostatus: 'completed', conflatingend(legitimate completion) withempty_turns(parser failure / model timeout) andno_done(SSE infra failure). The new switch routes the error paths throughclearLiveSessionAfterError, which appends an inline System message and sets the newstatus: 'error'. The session list gets a redAlertCircleicon for the error state.Touches:
director-graph.ts,lib/types/chat.ts,lib/store/settings.ts,components/agent/agent-bar.tsx,components/settings/agent-settings.tsx,lib/chat/agent-loop.ts,components/chat/use-chat-sessions.ts,components/chat/session-list.tsx,eval/whiteboard-layout/runner.ts, 6 i18n locales.Test plan
pnpm checkcleanpnpm test— no new failures (9 pre-existing ssrf-guard DNS-bound failures unrelated)npx tsc --noEmit -p tsconfig.jsoncleanEVAL_DIRECTOR_MODEL=<model> pnpm eval:orchestration— premature-END eval passes (regression guard)EVAL_DIRECTOR_MODEL=<model> npx tsx --env-file=.env.local eval/orchestration/answering-runner.ts— 5/5 scenarios PASS, mean 100% correct rate with rule, baseline 52%